Death Penalty in America¶

By Katie Gower and Ekram Milash¶

Executions have been used as punishment for a crime in the United States since before it was a country. But punishments are not applied equally. This report explores how the death penalty has been used in the United States.

Research Question 1: How correlated are death penalty occurence and death row population in the United States?¶

Death penalty occurence and death row population have a moderate positive correlation. But they don't strictly rely on each other. Because of political factors and natural variability, there is no way to tell how many people will be executed based on how many people are on death row.

Research Question 2: How do rates of receiving the death penalty differ across races?¶

There are clear disparities between the rate that the white U.S. population received the death penalty and the rate for other races. For Black people, they were 3.36 times more likely to be given the death penalty than white people, with a strong regression model. For Asian, Pacific Islander and Other people, they were 1.08 times more likely to be executed than white people, with a moderately strong regression model. For Native Americans, they were found to be 2.55 times more likely to be executed than white people, with a weaker regression model.

Research Question 3: How do U.S. States compare in their cumulative executions and which states are still using the death penalty?¶

The bulk of cumulative executions in the United States have occured on the East Coast and in the South. In the last 10 years, the death penalty has been used in 20 states, mostly in the south and the midwest. 5 states have gone more than a century since using it, mostly in the midwest and northeast.

Modules used: Pandas, GeoPandas, Plotly, and Statsmodels.

Datasets¶

  • U.S. Population Data: Steven Manson, Jonathan Schroeder, David Van Riper, Katherine Knowles, Tracy Kugler, Finn Roberts, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 19.0 [dataset]. Minneapolis, MN: IPUMS. 2024. http://doi.org/10.18128/D050.V19.0""",
  • "Death Row Population: https://deathpenaltyinfo.org/death-row/overview/size-of-death-row-by-year",
  • "Death Penalty since 1976 dataset: https://deathpenaltyinfo.org/facts-and-research/data/executions",
  • "ESPY File - executions from 1608-2002: https://www.icpsr.umich.edu/web/NACJD/studies/8451",

Historical Sources:¶

  • "About the creation of the ESPY file: https://archives.albany.edu/description/catalog/apap301"
  • "History of the Death Penalty: https://deathpenaltyinfo.org/resources/high-school/about-the-death-penalty/history-of-the-death-penalty"

Data Setting and Methods¶

We had two datasets for execution/death penalty cases. One is the ESPY file, ranging from 1608 to 2002. The other is from the Death Penalty Information Center (deathpenaltyinfo.org), ranging from 1977 to 2025. To combine these, we cleaned them as needed, concatenated the DataFrames, and checked for duplicate values.

The ESPY file is a dataset based on several decades of research done by historian M. Watt Espy. His research was done by finding Department of Corrections records, court proceeding documents, newspapers, and more. It was first published in 1987, and was updated to include data up to 2002. This was a passion project for Espy, and in his work, he uncovered information about botched executions, juvenile executions, and people executed for participating in a slave rebellion.

The Executions Database dataset from the Death Penalty Information Center is compiled from news reports, state Departments of Corrections and the NAACP Legal Defense Fund. It has the benefit of describing a recent, well-documented era, so can be expected to have more accuracy. Still, things can get missed and there's a degree of subjectivity to some of the variables such as race.

For research question 1, data about death row population counts was needed. The Death Penalty Information Center had the neccessary data. The only transformations that this needed was cleaning and casting the year value, renaming the columns, and some minor rearranging. Some of the data compiled in this dataset was credited to Bureau of Justice Statistics, and some to NAACP. Over time, the data collecting techniques have changed, and there's variation in the data due to the fact that it's voluntary for a jurisdiction to allow their data to be collected, but most jurisdictions do allow it.

For research question 2, data about different race's United States population over time was needed. For this, we used census data from IPUMS-NHGIS, ranging from 1970 to 2020. The greatest geography level the NHGIS had to offer was state, so a groupby operation was done to find national levels. This data was sparse, so we did linear interpolation to approximate population values in years that there wasn't data given for. It's expected that the population of each race rose steadily between each decade, so it's appropiate to approximate the values in between. It's not entirely accurate, but gives more depth to the analysis by allowing it to be done year by year.

In [8]:
"""
Importing all libraries, setting renderer.
"""
import pandas as pd
%pip install pyreadstat
%pip install --upgrade pip
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
%pip install geopandas as gpd
import numpy as np
%pip install plotly
import plotly
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.api as sm
import doctest
import io
import plotly.io as pio
from scipy.stats import pearsonr
!pip install openpyxl
import kaleido
pio.renderers.default = 'plotly_mimetype+notebook'
Requirement already satisfied: pyreadstat in c:\users\katie\anaconda3\lib\site-packages (1.3.2)
Requirement already satisfied: narwhals>=2.0 in c:\users\katie\anaconda3\lib\site-packages (from pyreadstat) (2.11.0)
Requirement already satisfied: numpy in c:\users\katie\anaconda3\lib\site-packages (from pyreadstat) (1.26.4)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: pip in c:\users\katie\anaconda3\lib\site-packages (25.3)
Note: you may need to restart the kernel to use updated packages.
Collecting geopandas
  Using cached geopandas-1.1.1-py3-none-any.whl.metadata (2.3 kB)
Note: you may need to restart the kernel to use updated packages.
ERROR: Could not find a version that satisfies the requirement as (from versions: none)
ERROR: No matching distribution found for as
Requirement already satisfied: plotly in c:\users\katie\anaconda3\lib\site-packages (6.3.0)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: narwhals>=1.15.1 in c:\users\katie\anaconda3\lib\site-packages (from plotly) (2.11.0)
Requirement already satisfied: packaging in c:\users\katie\anaconda3\lib\site-packages (from plotly) (25.0)
Requirement already satisfied: openpyxl in c:\users\katie\anaconda3\lib\site-packages (3.1.5)
Requirement already satisfied: et-xmlfile in c:\users\katie\anaconda3\lib\site-packages (from openpyxl) (1.1.0)

Data Loading, Cleaning, and Transformations

In [9]:
"""
Loading in the ESPY file - a dataset running from 1608-2002 containing rows for indivudal 
executions. Renaming columns and subsetting appropiately.
Important columns are the name of the person being executed, year of execution, and race of the 
person.
"""

espy = pd.read_spss("08451-0001-Data.sav")
espy.rename(columns={'V5':'Race', 'V7': 'Name', 'V14': 'Year', 'V16': 'State'}, inplace=True)
espy = espy.loc[:, ["Name", "Year", "State", "Race"]]
In [10]:
import sys
print(sys.path)
['c:\\Users\\katie\\anaconda3\\python312.zip', 'c:\\Users\\katie\\anaconda3\\DLLs', 'c:\\Users\\katie\\anaconda3\\Lib', 'c:\\Users\\katie\\anaconda3', '', 'c:\\Users\\katie\\anaconda3\\Lib\\site-packages', 'c:\\Users\\katie\\anaconda3\\Lib\\site-packages\\win32', 'c:\\Users\\katie\\anaconda3\\Lib\\site-packages\\win32\\lib', 'c:\\Users\\katie\\anaconda3\\Lib\\site-packages\\Pythonwin', 'c:\\Users\\katie\\anaconda3\\Lib\\site-packages\\setuptools\\_vendor']
In [11]:
def reformat_name(name : str) -> str:
    """
    Takes strings in the format "LAST FIRST MIDDLE", then returns strings in the format 
    "First Middle Last", without the middle name if the original string didn't have one.

    >>> reformat_name("AA BB CC")
    'Bb Cc Aa'
    >>> reformat_name("GOWER KATIE")
    'Katie Gower'
    >>> reformat_name("GOWER KATIE ANNE")
    'Katie Anne Gower'
    >>> reformat_name("WISER THE IDLER WHEEL IS")
    'The Idler Wheel Is Wiser'
    """
    names = name.split()
    formatted_name = ""
    for name in names[1:]:
        formatted_name = formatted_name + name.capitalize() + " "
    if len(names) != 0:
        formatted_name = formatted_name + names[0].capitalize()
    return formatted_name
In [12]:
espy["Name"] = espy["Name"].apply(reformat_name)

doctest.testmod()

"""
Loading the dataset from the Death Penalty Information Center (deathpenaltyinfo.org),
which has death penalty data by case from 1977 to 2025.
Creating new columns for date and name to normalize with the ESPY dataset.
Subsetting to only include important columns: Name, Year, State, and Race.
"""
death_penalty_info = pd.read_csv("executions.csv")
death_penalty_info["Year"] = death_penalty_info["Execution Date"].str[-4:]
death_penalty_info["Name"] = death_penalty_info["First Name"] + " " + death_penalty_info["Last Name"]
death_penalty_info = death_penalty_info.loc[:, ["Name", "Year", "State", "Race"]]
In [13]:
"""
Concatenating the two DataFrames to create one and sorting by year.
Dropping all duplicate rows in the dataset.
"""
all_executions = pd.concat([espy, death_penalty_info])
all_executions = all_executions.astype({'Year' : int})
all_executions.sort_values(by = "Year", inplace = True)

def drop_duplicates(df : pd.DataFrame) -> pd.DataFrame:
  """
  Takes a DataFrame.
  Removes all duplicates of rows with the same name and year value, leaving one copy.
  Does not drop duplicates of rows with no name value because they could be distinct.
  Returns the DataFrame with duplicates dropped.
  """
  rows_with_names = df[df['Name'] != '']
  duplicates = rows_with_names[rows_with_names.duplicated(['Name', 'Year'], keep = 'first')].index
  return df.drop(duplicates).reset_index().drop(columns = ['index'])
all_executions = drop_duplicates(all_executions)

assert(drop_duplicates(test_df_duplicates1).equals(duplicates1_return))
assert(drop_duplicates(test_df_duplicates2).equals(duplicates2_return))
In [14]:
"""
Loading the death row population data. 
Transforming to be sorted and have correct column names.
Doing a minor cleaning operation on certain Year values.
"""
death_row = pd.read_excel('Size-of-Death-Row-(1968-–-present).xlsx')
death_row.at[55,1968] = 1968
death_row.at[55,517] = 517
death_row = death_row.rename(columns = {1968: 'Year', 517: 'Population'})
death_row.loc[48:54, 'Year'] = death_row.loc[48:54, 'Year'].str[:-1]
death_row['Year'] = death_row['Year'].apply(pd.to_numeric)

"""
Creating the DataFrame of U.S. population by race  by loading the NHGIS census data.
Subsetting the dataframe, renaming columns, and using groupby() to get the data in an 
analogous layout to the executions_by_race dataframe that will later be created.
"""
us_pop = pd.read_csv('nhgis0002_ts_nominal_state.csv')
us_pop = us_pop.loc[:, ['YEAR', 'B18AA', 'B18AB', 'B18AC', 'B18AD']]
us_pop.rename(columns = {'B18AA' : 'White',
        'B18AB' : 'Black',
        'B18AC' : 'American Indian and Alaska Native',
        'B18AD' : 'Asian and Pacific Islander and Other Race'}, inplace = True)
us_pop = us_pop.groupby('YEAR').sum()

Results¶

Research Question 1: How correlated are death penalty occurence and death row population in the United States?¶

In [15]:
"""
Performing linear regression on the relationship between death penalties and the death row population.
"""
execution_count = pd.DataFrame(all_executions.groupby('Year').size().loc[1968:2023])
execution_count.columns = ['Executions']
execution_count = execution_count.reset_index()

executions_and_death_row = execution_count.merge(death_row, on='Year')

x = executions_and_death_row ['Population']
y = executions_and_death_row ['Executions']
x_const = sm.add_constant(x)

the_model = sm.OLS(y, x_const)
results = the_model.fit()

correlation, pvalue = pearsonr(x, y)
print(f"\nthe Pearson correlation of the graph: {correlation:.3f} (p = {pvalue:.3f})")

""" Plotting the relationship between the execution and death row. """
sns.regplot(x=x, y=y, line_kws={'color': 'black'})
plt.title('Death Row Population vs Executions')
plt.xlabel('Death Row Population')
plt.ylabel('Death Penalties')
plt.tight_layout()
plt.show()

""" Plotting the difference between the predicted value and observed value. """
residuals_left = y - results.predict(x_const)
plt.figure(figsize=(8, 4))
sns.residplot(x = executions_and_death_row['Year'], y = residuals_left, lowess = True, \
            line_kws = {"color": "pink"})
plt.title("Plotted Residuals")
plt.xlabel("Year")
plt.ylabel("Residuals")
plt.show()
the Pearson correlation of the graph: 0.703 (p = 0.000)
No description has been provided for this image
No description has been provided for this image

The model had a Pearson correlation of 0.703. By this value, we know that death row population and death penalties have a positive correlation, but not a very strong relationship. This is expected, and can be hypothesied to be because of many factors. The death penalty coming to fruition is a rare event, so each individual case can sway the number for a given year by a large amount. Also, political events and trends influence how often and how quickly people are given the death penalty. By looking at the graphs, we can see that there was a period around year 2000 where the death penalty happened more often in comparison to the death row population, and in the other periods, it was less likely than the predictions.

Research Question 2: How do rates of receiving the death penalty differ across races?¶

Because the population dataset only had one value per race per decade, linear interpolation was used to approximate population counts for each year in between. This was beneficial because for each race, a proportion between death penalty count per 1 million U.S. poulation was calculated. The interpolation allowed for those values to be calculated for every year between 1970 and 2020, for each race category.

The NHGIS population data only had four racial categories: 'White', 'Black', 'American Indian and Alaska Native', and 'Asian and Pacific Islander and Other Race'. This limited the analysis because two of the categories, especially the final, are grouping together several races into one label, so they cannot be told apart in the data. If somebody wished to learn something about Latino people who weren't white or black, or anybody else who would go into the "Other Race" label, they wouldn't be able to know their specific experience.

In [16]:
def interpolate_population(df, start_yr : int, end_yr : int) -> pd.DataFrame:
  """
  Takes a DataFrame which has columns for year and for population values for several races.
  """
  index_year = []
  population_values = {'White Pop': [], 'Black Pop': [], 'American Indian and Alaska Native Pop': [], \
                     'Asian and Pacific Islander and Other Race Pop': []}
  for year in range(start_yr, end_yr + 1):
    for race in population_values.keys():
      rate = df.loc[(year//10) * 10, race[:-4]]
      if year != end_yr:
        rate += (df.loc[(year//10)*10 + 10, race[:-4]] - df.loc[(year // 10)*10, race[:-4]]) * ((year%10)/10)
      population_values[race].append(rate)
  return pd.DataFrame(population_values, index = range(start_yr, end_yr + 1))

population_data = interpolate_population(us_pop, 1970, 2020)

assert(interpolate_population(test_df_interpolate1, 1970, 1990).equals(interpolate1_return))

The death penalty datasets and the U.S. population dataset have different labels for races, so we needed to decide which labels would go into which category that mapped over to the population data. The U.S. population dataset has Asian, Pacific Islander, and Other Race all in one category and does not have data for Hispanic or Latino people. The source's codebook had no mention of how those people were labeled, so they were included as 'Other Race' and put in the Asian, Pacific Islander, and Other Race category.

In [17]:
"""
Recreating the DataFrame so the data can be compared to the population data.
"""
def df_count_and_recategorize(df : pd.DataFrame, start_year, end_year) -> pd.DataFrame:
  """
  Takes a DataFrame with rows of individual events and columns for year and race.
  Only works for DataFrames that categorize race under: 'White', 'Black', 'Asian',
  'Asian Pacific Il', 'Other', 'Other Race', 'Latino/a', 'Hispanic', 'Native American',
  and 'American Indian or Alaska Native'. Other labels will cause KeyErrors.
  Returns a DataFrame of every year's number of events for each category.
  """
  races = pd.DataFrame()
  races['White'] = df[df['Race'] == 'White'].groupby('Year').size().reindex(range(start_year, end_year + 1), fill_value = 0)
  races['Black'] = df[df['Race'] == 'Black'].groupby('Year').size().reindex(range(start_year, end_year + 1), fill_value = 0)
  races["Asian and Pacific Islander and Other Race"] = df[(df['Race'] == 'Asian') | \
   (df['Race'] == 'Asian-Pacific Il') | (df['Race'] == 'Other') | \
    (df['Race'] == 'Other Race') | (df['Race'] == 'Hispanic') | \
     (df['Race'] == 'Latino/a')].groupby('Year').size().reindex(range(start_year, end_year + 1), fill_value = 0)
  races['Native American'] = df[(df['Race'] == 'Native American') |\
   (df['Race'] == 'American Indian or Alaska Native')].groupby('Year').size().reindex(range(start_year, end_year + 1), fill_value = 0)
  races =  races.fillna(0)
  return races.reset_index()

all_executions_subset = all_executions
executions_by_race = df_count_and_recategorize(all_executions_subset, 1608, 2025)
population_data_frame = population_data.reset_index().rename(columns = {'index' : 'Year'})

assert(df_count_and_recategorize(test_recategorize, 2000, 2003).equals(recategorize_return))

"""
Creating a plot of executions over time for each race from 1608-2025.
"""
count_subset = executions_by_race.loc[:,['Year','White', 'Black', 'Asian and Pacific Islander and Other Race', 'Native American']].sort_values(by = 'Year')
fig4 = px.line(count_subset,
            x = 'Year',
            y = ['White', 'Black', 'Asian and Pacific Islander and Other Race', 'Native American'],
            labels = {'Count Per 1M' : 'Execution Count Per 1M', 'variable' : 'Race', 'value' : 'Executions'},
            title = "Executions Over Time",
            template = 'plotly_white')
fig4.show()

The graph above charts the number of people of each race who were killed by the death penalty in a given year.

In 1972, the Supreme Court ruled in Furman v. Georgia that statutes detailing the use of the death penalty were under the definition of 'cruel and unusual' an thereby unconstitutional. This ruling did not declare the death penalty itself unconstitutional, but all existing statutes legislating it. This means that states could rewrite their death penalty statutes, and in 1976, these reforms were approved by the court. This period is why there's a period of 0 executions on this graph.

The lines for white and Black people are startling similar. It appears that throughout the history of the United States, white and Black people have been given the death penalty roughly the same number of times. But— there have always been more white people in the U.S. than Black people, so they must be punished at different rates. This is explored in the next graph.

Because the IPUMS-NHGIS population data starts in 1970, the execution rates when controlled for population can only be charted from 1970 to 2020.

To control for population, a death penalties per 1 million people value was calculated for each race over time. This is done by taking the number of executions happening to people of that race, dividing it by the U.S. population of that race, and multiplying by a million. The ratio was given per 1 million people because the death penalty is a very rare event, so the ratio per person would be a very small value that's hard to wrap your head around.

In [18]:
"""
Creating columns for each race that are of executions per million values.
"""
executions_by_race = executions_by_race.merge(population_data_frame, on = 'Year')

def create_per_1m_columns(df: pd.DataFrame) -> pd.DataFrame:
  """
    Takes a DataFrame with the columns: 'Year', 'White', 'Black', 'Asian and Pacific Islander and Other Race',
       'Native American', 'White Pop', 'Black Pop', 'American Indian and Alaska Native Pop', and
       'Asian and Pacific Islander and Other Race Pop'.
    Returns the same DataFrame but with 'Per 1M' columns that show the proportion between count of how
    many times the event has happened to someone of that race per 1 million people of that race.
  """
  df['White Per 1M'] = (df['White'] * 1000000.0)/df['White Pop']
  df['Black Per 1M'] = (df['Black'] * 1000000.0)/df['Black Pop']
  df['Native Per 1M'] = (df['Native American'] * 1000000.0)\
    /df['American Indian and Alaska Native Pop']
  df['Asian and Pacific Islander and Other Per 1M'] = (df\
    ['Asian and Pacific Islander and Other Race'] * 1000000.0)/df['Asian and Pacific Islander and Other Race Pop']
  return df

executions_by_race_per_1m = create_per_1m_columns(executions_by_race[(executions_by_race['Year'] >= 1970) & (executions_by_race['Year'] <= 2020)])

assert(create_per_1m_columns(test_1m_columns1).equals(test_1m_columns1_return))

"""
Doing a melt operation to transform the DataFrame into one that has a row for each Year and Race combination.
"""
execution_rates = executions_by_race_per_1m[['Year', 'White Per 1M', 'Black Per 1M', 'Native Per 1M', \
  'Asian and Pacific Islander and Other Per 1M']] \
       .rename(columns = {'White Per 1M' : 'White',
        'Black Per 1M' : 'Black',
        'Asian and Pacific Islander and Other Per 1M' : 'Asian and Pacific Islander and Other Race',
        'Native Per 1M' : 'Native American'} \
        ).melt('Year', var_name='Race', value_name='Count Per 1M')
execution_rates = execution_rates.sort_values(by = 'Year')

"""
Plotting the execution rate of each race.
"""
fig3 = px.line(execution_rates,
            x = 'Year',
            y = 'Count Per 1M',
            color = 'Race',
            labels = {'Count Per 1M' : 'Execution Count Per 1M'},
            title = "Death Penalty Counts Over Time, Controlled for Population",
            template = 'plotly_white')
fig3.show()

Now, we can see that execution rates for Black and Native people are much higher than for white people. The Asian, Pacific Islander, and Other Race category plotted at around the same rate, but slightly higher in some years.

According to the NHGIS U.S. population data, the Native American population count was at 792 thousand in 1970 and grew to 3.7 million by 2020. Because this population was so small and the death penalty happens to very few people, there were many years in this time frame where no Native American people died of it, even when the penalty was being used in high numbers, which can be seen where the purple line is at 0 for much of the graph above. This is, of course, good that the rate of execution wasn't high enough to affect a relatively small population every year, but it may skew the result of the linear regression model. If the United States had a much higher population, but the proportions between race and death penalty rates stayed the same, these 0 values would probably not exist. We decided to keep the 0 values included because it seemed it would be even less accurate to use only some values to make conclusions based on.

In [19]:
"""
Plotting a linear regression model and a scatterplot of each race's rate of executions
compared to white people's rate of executions.
"""
executions_by_race_per_1m.sort_values(by = 'White Per 1M', inplace = True)
race_columns = ['Black Per 1M', 'Native Per 1M', 'Asian and Pacific Islander and Other Per 1M']
for col in race_columns:
  x = executions_by_race_per_1m[['White Per 1M']]
  y = executions_by_race_per_1m[col]

  model = sm.OLS(y, x)
  results = model.fit()
  fig3 = px.scatter(executions_by_race_per_1m, 
                    'White Per 1M', 
                    col, 
                    trendline = 'ols')
  fig3.update_layout(yaxis_title = 'Death Penalties Per 1M ' + col[:-7] + " People",
                    xaxis_title = 'Death Penalties Per 1M White People',
                    title = 'Rate of ' + col[:-7] + ' People vs. White People Receiving the Death Penalty')
  fig3.show()
  print(results.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:           Black Per 1M   R-squared (uncentered):                   0.910
Model:                            OLS   Adj. R-squared (uncentered):              0.909
Method:                 Least Squares   F-statistic:                              508.0
Date:                Sun, 23 Nov 2025   Prob (F-statistic):                    7.55e-28
Time:                        21:21:50   Log-Likelihood:                          20.824
No. Observations:                  51   AIC:                                     -39.65
Df Residuals:                      50   BIC:                                     -37.72
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
White Per 1M     3.3613      0.149     22.539      0.000       3.062       3.661
==============================================================================
Omnibus:                       12.610   Durbin-Watson:                   2.478
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               35.697
Skew:                          -0.368   Prob(JB):                     1.77e-08
Kurtosis:                       7.032   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:          Native Per 1M   R-squared (uncentered):                   0.582
Model:                            OLS   Adj. R-squared (uncentered):              0.573
Method:                 Least Squares   F-statistic:                              69.50
Date:                Sun, 23 Nov 2025   Prob (F-statistic):                    5.04e-11
Time:                        21:21:51   Log-Likelihood:                         -15.749
No. Observations:                  51   AIC:                                      33.50
Df Residuals:                      50   BIC:                                      35.43
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
White Per 1M     2.5469      0.306      8.337      0.000       1.933       3.160
==============================================================================
Omnibus:                       15.882   Durbin-Watson:                   2.156
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               22.139
Skew:                           1.042   Prob(JB):                     1.56e-05
Kurtosis:                       5.464   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                                             OLS Regression Results                                             
================================================================================================================
Dep. Variable:     Asian and Pacific Islander and Other Per 1M   R-squared (uncentered):                   0.806
Model:                                                     OLS   Adj. R-squared (uncentered):              0.802
Method:                                          Least Squares   F-statistic:                              207.9
Date:                                         Sun, 23 Nov 2025   Prob (F-statistic):                    1.92e-19
Time:                                                 21:21:51   Log-Likelihood:                          55.987
No. Observations:                                           51   AIC:                                     -110.0
Df Residuals:                                               50   BIC:                                     -108.0
Df Model:                                                    1                                                  
Covariance Type:                                     nonrobust                                                  
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
White Per 1M     1.0791      0.075     14.419      0.000       0.929       1.229
==============================================================================
Omnibus:                       25.916   Durbin-Watson:                   1.653
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.300
Skew:                           1.463   Prob(JB):                     1.62e-12
Kurtosis:                       7.122   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

When controlling for population, the death penalty was found to be given to Black people 3.36 times as much as to white people, with a high R-squared value of 0.910. This value and strong correlation isn't surprising. With the slavery history of the United States, and the ongoing prejudice about Black people, it's expected that Black people are disproportionately given the death penalty because of racism.

The regression model found that the death penalty has been given to Native Americans 2.55 times as much as to white people, but with an R-squared value of only 0.582. This is because many 0 values were in the dataset, with unclarity about whether those values accurately lead to a rate of receiving the death penalty. These 0 values skew the line downwards, so in reality the ratio could be even higher than 2.55.

The model found Asian, Pacific Islander, and Other people to receive the penalty 1.08 times as much as white people, with an R-squared value of 0.806. Even with a decent R-squared value, it's hard to make much of this result, because the race category is so broad. There isn't a lot to be learned when it's combining several races into one value.

Research Question 3: How do U.S. States compare in their cumulative executions and which states are still using the death penalty?¶

The first plot for this question requires a DataFrame of cumulative numbers of executions per state per year. To create this, we recreated the dataset to have counts of executions per state per year, reindexed it to have a row for every year, filled the empty values with 0s, and then ran the cumulative sum function on the dataframe.

Plotly requires two letter state codes such as 'WA' for their built-in U.S. geodata to be used. So, the DataFrame was merged with a DataFrame mapping each state's name to it's state code.

In [21]:
"""
Subsetting to only have relevant information. Transforming the DataFrame from
one of individual events to one counting events per year per state.
"""
all_executions["Year"] = all_executions["Year"].apply(pd.to_numeric)
all_executions_subset = all_executions[['Year', 'State']]

def cumulative_executions_by_state(df: pd.DataFrame, state_codes : pd.DataFrame) -> pd.DataFrame:
  """
  Takes a DataFrame with rows of individual executions and columns for year, state, and state code.
  Returns a DataFrame of each year's number of executions for each state.
  """
  state_and_count = {}

  states = set(state_codes['State'])
  for state in states:
    if state in set(df['State']):
      state_and_count[state] = df[df['State'] == state].groupby('Year').size()
  df2 = pd.DataFrame(state_and_count).fillna(0)
  idx = range(df['Year'].min(), df['Year'].max() + 1)
  df2 = df2.reindex(idx, fill_value = 0).cumsum()
  df2['Year'] = df2.index
  df2 = df2.melt(id_vars = 'Year', var_name='State', value_name='Cumulative Executions').sort_values(by = 'Year')
  return df2.merge(state_codes, on = 'State')

all_executions_mapping = cumulative_executions_by_state(all_executions_subset, state_codes)

#assert(cumulative_executions_by_state(cumulative_executions_by_state_test, state_codes).equals(cumulative_executions_by_state_return))


"""
Creating a slider choropleth map of cumulative executions for each state.
"""
fig = px.choropleth(all_executions_mapping,
                    locationmode="USA-states",
                    locations = 'Code',
                    color = 'Cumulative Executions',
                    color_continuous_scale = "Blackbody",
                    animation_frame = 'Year',
                    range_color = (0, 1500),
                    scope = 'usa')
fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 0.00625
fig.layout.updatemenus[0].buttons[0].args[1]['transition']['duration'] = 0.00625
fig.show()

This map shows a a clear regional trend. The bulk of executions have happened in the southern and east coast states. By 2025, Texas has the most cumulative executions, and almost half of them have been in the last 40 years. In 1985, there were 767 cumulatively, and in 2025, 1483. This regional trend could be hypothesized to be explained by the historical significance of slavery in these states, and the ongoing conservative government and widespread poverty.

There's a flaw of this visual, where it seems to show that most executions in the United States have happened in the last 150 years. This is not neccessarily the case. Because of the lost evidence of centuries ago, the dataset may just underreport the number of executions by a lot. The Present Day

Now, we look at which states are still using the death penalty.

In [22]:
"""
Using a groupby to find the most recent year of the death penalty being used in each state,
and creating a bar plot of it.
"""
years_since_last = pd.DataFrame((2025 - all_executions_subset.groupby('State')['Year'].max()).sort_values())

bar_graph = go.Bar(y = years_since_last.index,
           x = years_since_last['Year'],
           orientation = 'h')
fig2 = go.Figure(bar_graph)
fig2.update_layout(yaxis = dict(autorange = "reversed"),
                height = 1200,
                template = 'plotly_white',
                xaxis_title = 'Years Since Last Use of the Death Penalty')
fig2.show()

These results show that the death penalty is still a very present carceral force in the United States.

Alabamba, Arizona, Indiana, Florida, Louisiana, Texas, South Carolina, Oklahoma, and Tennesee have all used the death penalty in 2025. Georgia, Missouri, and Utah used it last in 2024. 20 states have used the death penalty in the past decade. Many of those states are in the south, similar to the cumulative peaks, but many are in the midwest or other regions. Only 5 states have gone more than a century since using it, and they are mostly located in the midwest and northeast. Implications and Limitations

Flaws in the data setting have affected the results. The ESPY file is especially likely to be troubled. Because most of the entries are hundreds of years old, it's hard to know whether they are representative of all state executions done in that year, or are a snippet of the truth. Counts from the left tail-end of that dataset can't be taken as fact. Values computed from that dataset most likely become more accurate as the year approaches present day.

Some variables, such as the severity of the crime, causal and other factors are not represented in the analysis. They may not be controlled in the datasets which can cause some omissions.

In the analysis of the death penalty across states in recent decades, there's a flaw where it's hard to compare how often a state is using the death penalty in recent decades vs. history overall and if there's a change over time. There's also no analysis over whether the racial makeup of death penalty victims changes over time.

Another limiting factor was that the population data used was sparse. This meant execution rates could only be studied for a few decades and were based largely on approximations of the U.S. population. If more detailed data was used, the analysis could find a larger trend.

People should be aware of the significant disparity between how much people of different races are given the death penalty, especially between white and Black people. The analysis should be used to find patterns of systematic injustice and to understand what continues to happen to certain demographics. This analysis should not be used to make claims about who Black and native people are as people or cultures. Higher incidence of a punishment, in this case, doesn't mean anything about the people receiving the punishment. It's instead indicative of a racist country continuing racist practices.

In [23]:
""" State codes DataFrame for mapping from state name to state code. """
state_codes = pd.read_csv(io.StringIO("""
State,Code
Mississippi,MS
North Carolina,NC
Oklahoma,OK
Virginia,VA
West Virginia,WV
Louisiana,LA
Michigan,MI
Massachusetts,MA
Idaho,ID
Florida,FL
Nebraska,NE
Washington,WA
New Mexico,NM
Puerto Rico,PR
South Dakota,SD
Texas,TX
California,CA
Alabama,AL
Georgia,GA
Pennsylvania,PA
Missouri,MO
Colorado,CO
Utah,UT
Tennessee,TN
Wyoming,WY
New York,NY
Kansas,KS
Alaska,AK
Nevada,NV
Illinois,IL
Vermont,VT
Montana,MT
Iowa,IA
South Carolina,SC
New Hampshire,NH
Arizona,AZ
District of Columbia,DC
American Samoa,AS
United States Virgin Islands,VI
New Jersey,NJ
Maryland,MD
Maine,ME
Hawaii,HI
Delaware,DE
Guam,GU
Commonwealth of the Northern Mariana Islands,MP
Rhode Island,RI
Kentucky,KY
Ohio,OH
Wisconsin,WI
Oregon,OR
North Dakota,ND
Arkansas,AR
Indiana,IN
Minnesota,MN
Connecticut,CT
"""))

#DataFrames for testing drop_duplicates()
test_df_duplicates1 = pd.read_csv(io.StringIO("""
Name,Year,Month,Day,State,Race,Sex,Age
A,2022,3,9,Washington,White,Female,25
A,2022,3,9,Washington,White,Male,30
B,2022,3,9,Washington,White,Male,25
B,2021,3,9,Washington,White,Male,25
"""))

duplicates1_return = pd.read_csv(io.StringIO("""
Name,Year,Month,Day,State,Race,Sex,Age
A,2022,3,9,Washington,White,Female,25
B,2022,3,9,Washington,White,Male,25
B,2021,3,9,Washington,White,Male,25
"""))

test_df_duplicates2 = pd.read_csv(io.StringIO("""
Name,Year,Month,Day,State,Race,Sex,Age
A,2022,3,9,Washington,White,Male,20
A,2022,03,9,Washington,White,Male,20
"""))

duplicates2_return = pd.read_csv(io.StringIO("""
Name,Year,Month,Day,State,Race,Sex,Age
A,2022,3,9,Washington,White,Male,20
"""))

#DataFrames for testing interpolate_population()
test_df_interpolate1 = pd.read_csv(io.StringIO("""
Year,White,Black,American Indian and Alaska Native,Asian and Pacific Islander and Other Race
1970,1,2000,1,1
1980,11,3000,1,1
1990,21,5000,1,1
""")).set_index('Year')

interpolate1_return = pd.read_csv(io.StringIO("""
Year,White Pop,Black Pop,American Indian and Alaska Native Pop,Asian and Pacific Islander and Other Race Pop
1970,1.0,2000.0,1.0,1.0
1971,2.0,2100.0,1.0,1.0
1972,3.0,2200.0,1.0,1.0
1973,4.0,2300.0,1.0,1.0
1974,5.0,2400.0,1.0,1.0
1975,6.0,2500.0,1.0,1.0
1976,7.0,2600.0,1.0,1.0
1977,8.0,2700.0,1.0,1.0
1978,9.0,2800.0,1.0,1.0
1979,10.0,2900.0,1.0,1.0
1980,11.0,3000.0,1.0,1.0
1981,12.0,3200.0,1.0,1.0
1982,13.0,3400.0,1.0,1.0
1983,14.0,3600.0,1.0,1.0
1984,15.0,3800.0,1.0,1.0
1985,16.0,4000.0,1.0,1.0
1986,17.0,4200.0,1.0,1.0
1987,18.0,4400.0,1.0,1.0
1988,19.0,4600.0,1.0,1.0
1989,20.0,4800.0,1.0,1.0
1990,21.0,5000.0,1.0,1.0
""")).set_index('Year')

#DataFrames for testing df_count_and_recategorize()
test_recategorize = pd.read_csv(io.StringIO("""
Year,Race
2000,White
2000,White
2000,Black
2000,Hispanic
2000,Asian
2001,Black
2001,Other
2001,American Indian or Alaska Native
2001,European
2002,White
"""))

recategorize_return = pd.read_csv(io.StringIO("""
Year,White,Black,Asian and Pacific Islander and Other Race,Native American
2000,2,1,2,0
2001,0,1,1,1
2002,1,0,0,0
2003,0,0,0,0
"""))

#DataFrames for testing create_per_1m_columns()
test_1m_columns1 = pd.read_csv(io.StringIO("""
Year,White,Black,Asian and Pacific Islander and Other Race,Native American,\
White Pop,Black Pop,American Indian and Alaska Native Pop,Asian and Pacific Islander and Other Race Pop
2000,1,2,0,0,1000000,1000000,20000,2000000
2001,2,4,0,0,2000000,4000000,2000,3000
2002,0,0,500,20,1,1,1000,100
"""))

test_1m_columns1_return = pd.read_csv(io.StringIO("""
Year,White,Black,Asian and Pacific Islander and Other Race,Native American,\
White Pop,Black Pop,American Indian and Alaska Native Pop,Asian and Pacific Islander \
and Other Race Pop,White Per 1M,Black Per 1M,Native Per 1M,Asian and Pacific Islander and Other Per 1M
2000,1,2,0,0,1000000,1000000,20000,2000000,1.0,2.0,0.0,0.0
2001,2,4,0,0,2000000,4000000,2000,3000,1.0,1.0,0.0,0.0
2002,0,0,500,20,1,1,1000,100,0.0,0.0,20000.0,5000000.0
"""))

#DataFrames for testing cumulative_executions_by_state()
cumulative_executions_by_state_test = pd.read_csv(io.StringIO("""
Year,State,Code
2000,Washington,WA
2000,Washington,WA
2000,Oregon,OR
2001,Washington,WA
2001,Oregon,OR
"""))

cumulative_executions_by_state_return = pd.read_csv(io.StringIO("""
Year,State,Cumulative Executions,Code
2000,Oregon,1,OR
2000,Washington,2,WA
2001,Oregon,2,OR
2001,Washington,3,WA
"""))